Feature Selection for Ordinal Text Classification1
نویسندگان
چکیده
Ordinal classification (also known as ordinal regression) is a supervised learning task that consists of estimating the rating of a data item on a fixed, discrete rating scale. This problem is receiving increased attention from the sentiment analysis / opinion mining 1This is a revised and substantially extended version of a paper appeared as (Baccianella et al., 2010). The order in which the authors are listed is purely alphabetical; each author has given an equally important contribution to this work. community, due to the importance of automatically rating large amounts of product review data in digital form. As in other supervised learning tasks such as binary or multiclass classification, feature selection is often needed in order to improve efficiency and to avoid overfitting. However, while feature selection has been extensively studied for other classification tasks, it has not for ordinal classification. In this paper we present six novel feature selection methods that we have specifically devised for ordinal classification, and test them on two datasets of product review data against three methods previously known from the literature, using two learning algorithms from the “support vector regression” tradition. The experimental results show that all six proposed metrics largely outperform all three baseline techniques (and are more stable than them by an order of magnitude), on both datasets and for both learning algorithms.
منابع مشابه
Information Gain Feature Selection for Ordinal Text Classification using Probability Re-distribution
This paper looks at feature selection for ordinal text classification. Typical applications are sentiment and opinion classification, where classes have relationships based on an ordinal scale. We show that standard feature selection using Information Gain (IG) fails to identify discriminatory features, particularly when they are distributed over multiple ordinal classes. This is because inter-...
متن کاملSelecting Features for Ordinal Text Classification
We present four new feature selection methods for ordinal regression and test them against four different baselines on two large datasets of product reviews.
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملUsing Micro-Documents for Feature Selection: The Case of Ordinal Text Classification
Most popular feature selection methods for text classification (TC) are based on binary information concerning the presence/absence of the feature in each training document. As such, these methods do not exploit term frequency information. In order to overcome this drawback we break down each training document of length k into k training “microdocuments”, each consisting of a single word occurr...
متن کامل